Search CORE

CiteSeerX

arXiv.org e-Print Archive

Multi-Resource List Scheduling of Moldable Parallel Jobs under Precedence Constraints

Author: Perotin Lucas
Raghavan Padma
Sun Hongyang
Publication venue
Publication date: 13/06/2021
Field of study

The scheduling literature has traditionally focused on a single type of resource (e.g., computing nodes). However, scientific applications in modern High-Performance Computing (HPC) systems process large amounts of data, hence have diverse requirements on different types of resources (e.g., cores, cache, memory, I/O). All of these resources could potentially be exploited by the runtime scheduler to improve the application performance. In this paper, we study multi-resource scheduling to minimize the makespan of computational workflows comprised of parallel jobs subject to precedence constraints. The jobs are assumed to be moldable, allowing the scheduler to flexibly select a variable set of resources before execution. We propose a multi-resource, list-based scheduling algorithm, and prove that, on a system with

d

types of schedulable resources, our algorithm achieves an approximation ratio of

1.619d+2.545\sqrt{d}+1

for any

d

, and a ratio of

d+O(\sqrt[3]{d^2})

for large

d

. We also present improved results for independent jobs and for jobs with special precedence constraints (e.g., series-parallel graphs and trees). Finally, we prove a lower bound of

d

on the approximation ratio of any list scheduling scheme with local priority considerations. To the best of our knowledge, these are the first approximation results for moldable workflows with multiple resource requirements

Co-scheduling algorithms for high-throughput workload execution

Author: Aupy Guillaume
Benoit Anne
Raghavan Padma
Robert Yves
Shantharam Manu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

International audienceThis paper investigates co-scheduling algorithms for processing a set of parallel applications. Instead of executing each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several applications concurrently. We partition the original application set into a series of packs, which are executed one by one. A pack comprises several applications, each of them with an assigned number of processors, with the constraint that the total number of processors assigned within a pack does not exceed the maximum number of available processors. The objective is to determine a partition into packs, and an assignment of processors to applications, that minimize the sum of the execution times of the packs. We thoroughly study the complexity of this optimization problem , and propose several heuristics that exhibit very good performance on a variety of workloads, whose application execution times model profiles of parallel scientific codes. We show that co-scheduling leads to faster workload completion time (40% improvement on average over traditional scheduling) and to faster response times (50% improvement). Hence co-scheduling increases system throughput and saves energy, leading to significant benefits from both the user and system perspectives

Co-scheduling algorithms for cache-partitioned systems

Author: Aupy Guillaume
Benoit Anne
Pottier Loïc
Raghavan Padma
Robert Yves
Shantharam Manu
Publication venue: HAL CCSD
Publication date: 08/11/2016
Field of study

Cache-partitioned architectures allow subsections of theshared last-level cache (LLC) to be exclusively reserved for someapplications. This technique dramatically limits interactions between applicationsthat are concurrently executing on a multi-core machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications.Key scheduling questions are: (i) which proportionof cache and (ii) how many processors should be given to each application?Here, we assign rational numbers of processors to each application,since they can be shared across applications through multi-threading.In this paper, we provide answers to (i) and (ii) for perfectly parallel applications.Even though the problem is shown to be NP-complete, we give key elements to determinethe subset of applications that should share the LLC(while remaining ones only use their smaller private cache). Building upon these results,we design efficient heuristics for general applications.Extensive simulations demonstrate the usefulness of co-schedulingwhen our efficient cache partitioning strategies are deployed

Crossref

Conjugate gradient sparse solvers: performance-power characteristics

Author: Ingyu Lee
Korad Malkowski
Mary Jane Irwin
Padma Raghavan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

We characterize the performance and power attributes of the conjugate gradient (CG) sparse solver which is widely used in scientific applications. We use cycle-accurate sim-ulations with SimpleScalar and Wattch, on a processor and memory architecture similar to the configuration of a node of the BlueGene/L. We first demonstrate that substantial power savings can be obtained without performance degra-dation if low power modes of caches can be utilized. We next show that if Dynamic Voltage Scaling (DVS) can be used, power and energy savings are possible, but these are realized only at the expense of performance penalties. We then consider two simple memory subsystem optimiza-tions, namely memory and level-2 cache prefetching. We demonstrate that when DVS and low power modes of caches are used with these optimizations, performance can be im-proved significantly with reductions in power and energy. For example, execution time is reduced by 23%, power by 55 % and energy by 65 % in the final configuration at 500MHz relative to the original at 1GHz. We also use our codes and the CG NAS benchmark code to demonstrate that performance and power profiles can vary significantly depending on matrix properties and the level of code tun-ing. These results indicate that architectural evaluations can benefit if traditional benchmarks are augmented with codes more representative of tuned scientific applications.

CiteSeerX

Crossref

Ordonnancement de tâches parallèles sous multiples ressources : Ordonnancement de Listes vs ordonnancement de Packs

Author: Aupy Guillaume
Elghazi Redouane
Gainaru Ana
Raghavan Padma
Sun Hongyang
Publication venue: HAL CCSD
Publication date: 08/01/2018
Field of study

Scheduling in High-Performance Computing (HPC) has been traditionally centered around computing resources (e.g., processors/cores). The ever-growing amount of data produced by modern scientific applications start to drive novel architectures and new computing frameworks to support more efficient data processing, transfer and storage for future HPC systems. This trend towards data-driven computing demands the scheduling solutions to also consider other resources (e.g., I/O, memory, cache) that can be shared amongst competing applications. In this paper, we study the problem of scheduling HPC applications while exploring the availability of multiple types of resources that could impact their performance. The goal is to minimize the overall execution time, or makespan, for a set of moldable tasks under multiple-resource constraints. Two scheduling paradigms, namely, list scheduling and pack scheduling, are compared through both theoretical analyses and experimental evaluations. Theoretically, we prove, for several algorithms falling in the two scheduling paradigms, tight approximation ratios that increase linearly with the number of resource types. As the complexity of direct solutions grows exponentially with the number of resource types, we also design a strategy to indirectly solve the problem via a transformation to a single-resource-type problem, which can significantly reduce the algorithms' running times without compromising their approximation ratios. Experiments conducted on Intel Knights Landing with two resource types (processor cores and high-bandwidth memory) and simulations designed on more resource types confirm the benefit of the transformation strategy and show that pack-based scheduling, despite having a worse theoretical bound, offers a practically promising and easy-to-implement solution, especially when more resource types need to be managed.L'ordonnancement en Calcul Haute-Performance est traditionnellement centré autour des ressources de calculs (processeurs, coeurs). Suite à l'explosion des quantités de données dans les applications scientifiques, de nouvelles architectures et nouveaux paradigmes de calcul apparaissent pour soutenir plus efficacement les calculs dirigés par l'accès aux données. Nous présentons ici des solutions algorithmiques qui prennent en compte cette multitude de ressources

On-the-fly scheduling vs. reservation-based scheduling for unpredictable workflows

Author: Aupy Guillaume
Gainaru Ana
Huo Yuankai
Landman Bennett,
Raghavan Padma
Sun Hongyang
Publication venue: 'SAGE Publications'
Publication date: 01/01/2019
Field of study

International audienceScientific insights in the coming decade will clearly depend on the effective processing of large datasets generated by dynamic heterogeneous applications typical of workflows in large data centers or of emerging fields like neuroscience. In this paper, we show how these big data workflows have a unique set of characteristics that pose challenges for leveraging HPC methodologies, particularly in scheduling. Our findings indicate that execution times for these workflows are highly unpredictable and are not correlated with the size of the dataset involved or the precise functions used in the analysis. We characterize this inherent variability and sketch the need for new scheduling approaches by quantifying significant gaps in achievable performance. Through simulations, we show how on-the-fly scheduling approaches can deliver benefits in both system-level and user-level performance measures. On average, we find improvements of up to 35% in system utilization and up to 45% in average stretch of the applications, illustrating the potential of increasing performance through new scheduling approaches

STS-k: A Multilevel Sparse Triangular Solution Scheme for NUMA Multicores

Author: Aupy Guillaume
Benoit Anne
Booth Joshua
Kabir Humayun
Raghavan Padma
Robert Yves
Publication venue: HAL CCSD
Publication date: 01/08/2015
Field of study

We consider techniques to improve the performance of parallel sparse triangular solution on non-uniform memory architecture multicores by extending earlier coloring and level set schemes for single-core multiprocessors. We develop sts-k, where k represents a small number of transformations for latency reduction from increased spatial and temporal locality of data accesses. We propose a graph model of data reuse to inform the development of sts-k and to prove that computing an optimal cost schedule is NP-complete. We observe significant speed-ups with sts-3 on 32-core Intel Westmere-EX and 24-core AMD `MagnyCours' processors.Incremental gains solely from the 3-level transformations in sts-3 for a fixed ordering, correspond to reductions in execution times by factors of 1.4 (Intel) and 1.5 (AMD) for level sets and 2 (Intel) and 2.2 (AMD) for coloring. On average, execution times are reduced by a factor of 6 (Intel) and 4 (AMD) for sts-3 with coloring compared to a reference implementation using level sets

Crossref

Ordonnancement avec tolérance aux pannes pour des tâches parallèles à nombre de processeurs programmable

Author: Benoit Anne
Le Fèvre Valentin
Perotin Lucas
Raghavan Padma
Robert Yves
Sun Hongyang
Publication venue: HAL CCSD
Publication date: 01/01/2021
Field of study

We study the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time of the jobs, or the makespan, when jobs can fail due to silent errors and hence may need to be re-executed after each failure until successful completion. Our work generalizes the classical scheduling framework for failure-free jobs. To cope with silent errors, we introduce two resilient scheduling algorithms, LPA-List and Batch-List, both of which use the List strategy to schedule the jobs. Without knowing a priori how many times each job will fail, LPA-List relies on a local strategy to allocate processors to the jobs, while Batch-List schedules the jobs in batches and allows only a restricted number of failures per job in each batch. We prove new approximation ratios for the two algorithms under several prominent speedup models (e.g., roofline, communication, Amdahl, power, monotonic, and a mixed model). An extensive set of simulations is conducted to evaluate different variants of the two algorithms, and the results show that they consistently outperform some baseline heuristics. Overall, our best algorithm is within a factor of 1.6 of a lower bound on average over the entire set of experiments, and within a factor of 4.2 in the worst case.Ce rapport étudie l’ordonnancement résilient de tâches sur des plateformes de calcul à haute performance. Dans le problème étudié, il est possible de choisir le nombre constant de processeurs effectuant chaque tâche, en déterminant le temps d’exécution de ces dernières selon différent modèles de rendement. Nous décrivons des algorithmes dont l’objectif est deminimiser le temps total d’exécution, sachant que les tâches sont susceptibles d’échouer et de devoir être ré-effectuées à chaque erreur. Ce problème est donc une généralisation du cadre classique où toutes les tâches sont connues à priori et n’échouent pas. Nous décrivons un algorithme d’ordonnancement par listes de priorité, et prouvons de nouvelles bornes d’approximation pour trois modèles de rendement classiques (roofline, communication, Amdahl, power, monotonic, et un modèle qui mélange ceux-ci). Nous décrivons également un algorithme d’ordonnancement par lots, au sein desquels les tâches pourront échouer un nombre limité de fois, et prouvons alors de nouvelles bornes d’approximation pour des rendements quelconques. Enfin, nous effectuons des expériences sur un ensemble complet d’exemples pour comparer les niveaux de performance de différentes variantes de nos algorithmes, significativement meilleurs que les algorithmes simples usuels. Notre meilleure heuristique est en moyenne à un facteur

1.6

d’une borne inférieure de la solution optimale, et à un facteur

4.2

dans le pire cas